As part of the main problem that will be tackled in this notebook, I will discuss in section named
How many online customers are there in the dataset and what is their country of origin?
What are the countries that are most represented in the dataset?
Calculate the revenue that was made in each month and what is the percentage revenue based on the various countries?
# importing necessary Python libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as pyoff
import plotly.graph_objs as go
# avoid displaying warnings
import warnings
warnings.filterwarnings("ignore")
#import machine learning related libraries
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.cluster import KMeans
import xgboost as xgb
# Loading the data
df = pd.read_csv('online_retail_II.csv')
df.head()
| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom |
In this section, we will explore the given dataset to answer some general questions about the dataset. This will lead to cause us to introduce other features into the dataset which will help us to build a machine learning model that will help us to answer our main question.
Let's update the pandas dataframe df by renaming some of its columns.
# Rename the following columns:
# Invoice to InvoiceNo
# Customer ID to CustomerID
# Price to UnitPrice
df.rename(columns={'Invoice':'InvoiceNo', 'Customer ID':'CustomerID', 'Price':'UnitPrice'}, inplace=True)
df.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom |
We can check information about the dataframe with the info method.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1067371 entries, 0 to 1067370 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 1067371 non-null object 1 StockCode 1067371 non-null object 2 Description 1062989 non-null object 3 Quantity 1067371 non-null int64 4 InvoiceDate 1067371 non-null object 5 UnitPrice 1067371 non-null float64 6 CustomerID 824364 non-null float64 7 Country 1067371 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 65.1+ MB
From the output of the info method, we can see that the CustomerID and Description columns are likely to have missing values. Also we may want to change the data type of the objects in the InvoiceDate column to proper date objects with the to_datetime method.
Let's count the number of missing values in each column of the dataframe.
df.isnull().sum()
InvoiceNo 0 StockCode 0 Description 4382 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 243007 Country 0 dtype: int64
As predicted, the columns CustomerID and Description have $243007$ and $4382$ missing data respectively. We may want to drop these missing data from the dataframe using the dropna method.
df_data = df.dropna()
df_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 824364 entries, 0 to 1067370 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 824364 non-null object 1 StockCode 824364 non-null object 2 Description 824364 non-null object 3 Quantity 824364 non-null int64 4 InvoiceDate 824364 non-null object 5 UnitPrice 824364 non-null float64 6 CustomerID 824364 non-null float64 7 Country 824364 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 56.6+ MB
Next, we update the dataframe df_data by converting the date field, InvoiceDate to datetime object. One can use the to_datetime method to achieve this.
df_data.InvoiceDate = pd.to_datetime(df_data.InvoiceDate)
df_data.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom |
pd.DataFrame(df_data.InvoiceDate.describe())
| InvoiceDate | |
|---|---|
| count | 824364 |
| unique | 41439 |
| top | 2011-11-14 15:27:00 |
| freq | 543 |
| first | 2009-12-01 07:45:00 |
| last | 2011-12-09 12:50:00 |
From the above, we see that the online purchases made by customers was from 2009-12-01 to 2011-12-09.
In the subsequent cells, we will answer some questions we would like to know from the given dataset.
How many online customers are there in the dataset and what is their country of origin?
print('From the dataset, the online retail shop has {} customers \
from {} different countries.'.format(len(df_data.CustomerID.unique()), len(df_data.Country.unique())))
From the dataset, the online retail shop has 5942 customers from 41 different countries.
ctm_cntry_df = df_data.groupby(['CustomerID', 'Country']).count().reset_index()
ctm_cntry_df = ctm_cntry_df.groupby('Country')['CustomerID'].count().reset_index().sort_values(
by=['CustomerID'], ascending=False)
# Create a new column, Percentage to calculate the customer representation in percentage
ctm_cntry_df['Percentage']= np.round(ctm_cntry_df.CustomerID / ctm_cntry_df.CustomerID.sum() * 100, 2)
ctm_cntry_df.head(10)
| Country | CustomerID | Percentage | |
|---|---|---|---|
| 38 | United Kingdom | 5410 | 90.85 |
| 14 | Germany | 107 | 1.80 |
| 13 | France | 95 | 1.60 |
| 32 | Spain | 41 | 0.69 |
| 3 | Belgium | 29 | 0.49 |
| 28 | Portugal | 24 | 0.40 |
| 24 | Netherlands | 23 | 0.39 |
| 34 | Switzerland | 22 | 0.37 |
| 33 | Sweden | 19 | 0.32 |
| 18 | Italy | 17 | 0.29 |
The output above, shows the top $10$ countries of that purchase from the online retail shop with the United Kingdom being the country with the highest customers. They represent about $90.85\%$ of the entire online customers.
Although $13$ online customers indicated that they were living in two countries, this does not significantly affect the values percetage values recorded. For these $13$ online customers, it may be that the countries recorded for their last purchases is where they are currently living.
# Initialise a dict object. This is supposed to collect all customers who indicated more than 1 country
ctm_2_cntry = {}
# for each index and customer id
for idx, cid in enumerate(df_data.CustomerID.unique()):
# Get the country that the customer resides
cntry = df_data[df_data.CustomerID == cid].Country.unique()
# if customer has lived in more than 1 country
if len(cntry) > 1:
# Update the ctm_2_cntry
ctm_2_cntry[cid] = cntry
pd.DataFrame(ctm_2_cntry) # Create a pandas dataframe using ctm_2_cntry
| 12417.0 | 12455.0 | 12422.0 | 12431.0 | 12745.0 | 12370.0 | 12429.0 | 12423.0 | 12449.0 | 12457.0 | 12652.0 | 12413.0 | 12394.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | Cyprus | Australia | Belgium | EIRE | Austria | Denmark | Denmark | Denmark | Switzerland | Germany | Spain | Belgium |
| 1 | Spain | Spain | Switzerland | Australia | United Kingdom | Cyprus | Austria | Belgium | Belgium | Cyprus | France | France | Denmark |
In the two next cells, we group countries with customer percentage value less than or equal to $0.25$ together and then plot a pie chart.
percent_margin = 0.25
# Create a new column, CountryCategory and set values to the corresponding values of the Country column
ctm_cntry_df['CountryCategory'] = ctm_cntry_df.Country
# Set Countries with Percentage less than or equal to percent_margin to 'Other Countries'
ctm_cntry_df.loc[ctm_cntry_df.Percentage <= percent_margin, 'CountryCategory'] = 'Other Countries'
ctm_cntry_df.head(11)
| Country | CustomerID | Percentage | CountryCategory | |
|---|---|---|---|---|
| 38 | United Kingdom | 5410 | 90.85 | United Kingdom |
| 14 | Germany | 107 | 1.80 | Germany |
| 13 | France | 95 | 1.60 | France |
| 32 | Spain | 41 | 0.69 | Spain |
| 3 | Belgium | 29 | 0.49 | Belgium |
| 28 | Portugal | 24 | 0.40 | Portugal |
| 24 | Netherlands | 23 | 0.39 | Netherlands |
| 34 | Switzerland | 22 | 0.37 | Switzerland |
| 33 | Sweden | 19 | 0.32 | Sweden |
| 18 | Italy | 17 | 0.29 | Italy |
| 12 | Finland | 15 | 0.25 | Other Countries |
plot_data = [
go.Pie(
labels=ctm_cntry_df.CountryCategory,
values=ctm_cntry_df.Percentage,
)
]
plot_layout = go.Layout(
title='Customer Country Count in Percentage.',
legend_title="Countries Represented"
)
fig = go.Figure(
data=plot_data, layout=plot_layout
)
pyoff.iplot(fig)
Calculate the revenue that was made in each month and what is the percentage revenue based on countries?
df_data['InvoiceYearMonth'] = df_data['InvoiceDate'].map(lambda date: 100*date.year + date.month)
df_data.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvoiceYearMonth | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 200912 |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 200912 |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 200912 |
df_data['Revenue'] = df_data.UnitPrice * df_data.Quantity
df_data.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvoiceYearMonth | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 200912 | 83.4 |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 | 81.0 |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 | 81.0 |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 200912 | 100.8 |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 200912 | 30.0 |
ctm_revenue = df_data.groupby('InvoiceYearMonth').Revenue.sum().reset_index()
ctm_revenue.head()
| InvoiceYearMonth | Revenue | |
|---|---|---|
| 0 | 200912 | 663272.050 |
| 1 | 201001 | 531952.902 |
| 2 | 201002 | 489399.586 |
| 3 | 201003 | 635996.481 |
| 4 | 201004 | 560635.022 |
pd.DataFrame(ctm_revenue.Revenue.describe())
| Revenue | |
|---|---|
| count | 2.500000e+01 |
| mean | 6.659317e+05 |
| std | 2.154966e+05 |
| min | 3.425244e+05 |
| 25% | 5.599246e+05 |
| 50% | 5.872565e+05 |
| 75% | 7.810333e+05 |
| max | 1.134879e+06 |
plot_data = [
go.Scatter(
x=ctm_revenue['InvoiceYearMonth'],
y=ctm_revenue['Revenue'],
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
title='Montly Revenue from Dec. 2009 to Dec. 2011',
xaxis_title="Invoice Year-Month",
yaxis_title="Monthly Revenue",
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
From the graph figure above, we observe that the online retail shop made the highest sales in November 2010, followed by November 2011. Also, one can observe that, monthly sales rises after August.
Next, explore the percentage revenue generated by the retail shop based on the countries their customers reside.
cntry_revenue_df = df_data.groupby(['Country']).Revenue.sum().reset_index().sort_values(by=['Revenue'],
ascending=False)
cntry_revenue_df['Percentage'] = np.round(cntry_revenue_df.Revenue / cntry_revenue_df.Revenue.sum() * 100, 2)
cntry_revenue_df.head(5)
| Country | Revenue | Percentage | |
|---|---|---|---|
| 38 | United Kingdom | 1.380642e+07 | 82.93 |
| 10 | EIRE | 5.785016e+05 | 3.47 |
| 24 | Netherlands | 5.485250e+05 | 3.29 |
| 14 | Germany | 4.179886e+05 | 2.51 |
| 13 | France | 3.265047e+05 | 1.96 |
From the output above, the top $5$ countries with respect to revenue generated are:
with the United Kingdom recording the highest in percentage $(82.93\%)$.
In the next two cells, we update cntry_revenue_df with a new
In the two next cells, we group countries with revenue percentage value less than or equal to $0.25$ together and then plot a pie chart.
percent_margin = 0.25
# Create a new column, CountryCategory and set values to the corresponding values of the Country column
cntry_revenue_df['CountryCategory'] = cntry_revenue_df.Country
# Set Countries with Percentage less than or equal to percent_margin to 'Other Countries'
cntry_revenue_df.loc[cntry_revenue_df.Percentage <= percent_margin, 'CountryCategory'] = 'Other Countries'
cntry_revenue_df.head(11)
| Country | Revenue | Percentage | CountryCategory | |
|---|---|---|---|---|
| 38 | United Kingdom | 1.380642e+07 | 82.93 | United Kingdom |
| 10 | EIRE | 5.785016e+05 | 3.47 | EIRE |
| 24 | Netherlands | 5.485250e+05 | 3.29 | Netherlands |
| 14 | Germany | 4.179886e+05 | 2.51 | Germany |
| 13 | France | 3.265047e+05 | 1.96 | France |
| 0 | Australia | 1.671291e+05 | 1.00 | Australia |
| 34 | Switzerland | 9.908281e+04 | 0.60 | Switzerland |
| 32 | Spain | 9.185948e+04 | 0.55 | Spain |
| 33 | Sweden | 8.745542e+04 | 0.53 | Sweden |
| 9 | Denmark | 6.574109e+04 | 0.39 | Denmark |
| 3 | Belgium | 6.357449e+04 | 0.38 | Belgium |
plot_data = [
go.Pie(
labels=cntry_revenue_df.CountryCategory,
values=cntry_revenue_df.Percentage,
)
]
plot_layout = go.Layout(
title='Country Revenue in Percentage',
legend_title = 'Countries Represented'
)
fig = go.Figure(
data=plot_data, layout=plot_layout
)
pyoff.iplot(fig)
The goal of this section is to come up with a make a model using the given dataframe df_data, to estimate if a given customer will buy something again from the online shop in the next quarter.
The dataframe is split into two.
The first sub-dataframe assigned to the Python variable ctm_bhvr_dt contains purchases made by customers from 01-12-2009 to 30-08-2011. This dataframe will be used to study the behavioural purchases of the online customers.
The second sub-dataframe assigned to the Python variable ctm_next_quarter will be used to study the behavioural purchases of the customers in the next quarter. That is, from 01-09-2011 to 30-11-2011.
df_data.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | InvoiceYearMonth | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom | 200912 | 83.4 |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 | 81.0 |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom | 200912 | 81.0 |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom | 200912 | 100.8 |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom | 200912 | 30.0 |
ctm_bhvr_dt = df_data[(df_data.InvoiceDate < pd.Timestamp(2011,9,1)) &
(df_data.InvoiceDate >= pd.Timestamp(2009,12,1))].reset_index(drop=True)
ctm_next_quarter = df_data[(df_data.InvoiceDate < pd.Timestamp(2011,12,1)) &
(df_data.InvoiceDate >= pd.Timestamp(2011,9,1))].reset_index(drop=True)
ctm_dt = pd.DataFrame(ctm_bhvr_dt['CustomerID'].unique())
ctm_dt.columns = ['CustomerID']
ctm_dt.head()
| CustomerID | |
|---|---|
| 0 | 13085.0 |
| 1 | 13078.0 |
| 2 | 15362.0 |
| 3 | 18102.0 |
| 4 | 12682.0 |
Let's find the first purchase made by each customer in the next quarter.
# Create a dataframe with CustomerID and customers first purchase
# date in ctm_next_quarter
ctm_1st_purchase_in_next_quarter = ctm_next_quarter.groupby('CustomerID').InvoiceDate.min().reset_index()
ctm_1st_purchase_in_next_quarter.columns = ['CustomerID','MinPurchaseDate']
ctm_1st_purchase_in_next_quarter.head()
| CustomerID | MinPurchaseDate | |
|---|---|---|
| 0 | 12347.0 | 2011-10-31 12:25:00 |
| 1 | 12348.0 | 2011-09-25 13:13:00 |
| 2 | 12349.0 | 2011-11-21 09:51:00 |
| 3 | 12352.0 | 2011-09-20 14:34:00 |
| 4 | 12356.0 | 2011-11-17 08:40:00 |
Let's find the last purchase made by each customer in the dataframe ctm_bhvr_dt.
ctm_last_purchase_bhvr_dt = ctm_bhvr_dt.groupby('CustomerID').InvoiceDate.max().reset_index()
ctm_last_purchase_bhvr_dt.columns = ['CustomerID','MaxPurchaseDate']
ctm_last_purchase_bhvr_dt.head()
| CustomerID | MaxPurchaseDate | |
|---|---|---|
| 0 | 12346.0 | 2011-01-18 10:17:00 |
| 1 | 12347.0 | 2011-08-02 08:48:00 |
| 2 | 12348.0 | 2011-04-05 10:47:00 |
| 3 | 12349.0 | 2010-10-28 08:23:00 |
| 4 | 12350.0 | 2011-02-02 16:01:00 |
Let's merge the two dataframes ctm_last_purchase_bhvr_dt and ctm_1st_purchase_in_next_quarter.
# Merge two dataframes ctm_last_purchase_bhvr_dt and ctm_1st_purchase_in_next_quarter
ctm_purchase_dates = pd.merge(ctm_last_purchase_bhvr_dt, ctm_1st_purchase_in_next_quarter, on='CustomerID',
how='left')
ctm_purchase_dates.head()
| CustomerID | MaxPurchaseDate | MinPurchaseDate | |
|---|---|---|---|
| 0 | 12346.0 | 2011-01-18 10:17:00 | NaT |
| 1 | 12347.0 | 2011-08-02 08:48:00 | 2011-10-31 12:25:00 |
| 2 | 12348.0 | 2011-04-05 10:47:00 | 2011-09-25 13:13:00 |
| 3 | 12349.0 | 2010-10-28 08:23:00 | 2011-11-21 09:51:00 |
| 4 | 12350.0 | 2011-02-02 16:01:00 | NaT |
Let's calculate the time difference in days between customer's last purchase in the dataframe ctm_last_purchase_bhvr_dt and the first purchase in the dataframe ctm_1st_purchase_in_next_quarter.
ctm_purchase_dates['NextPurchaseDay'] = (ctm_purchase_dates['MinPurchaseDate'] - ctm_purchase_dates['MaxPurchaseDate']).dt.days
ctm_purchase_dates.head()
| CustomerID | MaxPurchaseDate | MinPurchaseDate | NextPurchaseDay | |
|---|---|---|---|---|
| 0 | 12346.0 | 2011-01-18 10:17:00 | NaT | NaN |
| 1 | 12347.0 | 2011-08-02 08:48:00 | 2011-10-31 12:25:00 | 90.0 |
| 2 | 12348.0 | 2011-04-05 10:47:00 | 2011-09-25 13:13:00 | 173.0 |
| 3 | 12349.0 | 2010-10-28 08:23:00 | 2011-11-21 09:51:00 | 389.0 |
| 4 | 12350.0 | 2011-02-02 16:01:00 | NaT | NaN |
# merge with ctm_dt
ctm_dt = pd.merge(ctm_dt, ctm_purchase_dates[['CustomerID','NextPurchaseDay']], on='CustomerID', how='left')
ctm_dt.head()
| CustomerID | NextPurchaseDay | |
|---|---|---|
| 0 | 13085.0 | NaN |
| 1 | 13078.0 | 13.0 |
| 2 | 15362.0 | NaN |
| 3 | 18102.0 | 27.0 |
| 4 | 12682.0 | 15.0 |
Update the dataframe ctm_dt by filling all missing values with $9999$.
ctm_dt = ctm_dt.fillna(9999)
ctm_dt.head()
| CustomerID | NextPurchaseDay | |
|---|---|---|
| 0 | 13085.0 | 9999.0 |
| 1 | 13078.0 | 13.0 |
| 2 | 15362.0 | 9999.0 |
| 3 | 18102.0 | 27.0 |
| 4 | 12682.0 | 15.0 |
Next, we will define some features and add them to the dataframe ctm_dt to build our machine learning model. We will use the Recency - Frequency - Monetary Value segmentation method. That is, we will put the customers into groups based on the following:
Recency: Customers purchase behaviour based on their most recent purchase date and how many days they have been inactive since their last purchase.
Frequency: Customers purchase behaviour based on the number of times they buy from the online retail shop.
Monetary Value/Revenue: Customers purchase behaviour based the revenue they generate.
After we will apply K-means clustering to assign customers a score to each of the features.
Let's find the most recent purchase date of each customer and see how many days they have been inactive. Afterwards, we can apply K-means clustering to assign customers a recency score.
ctm_max_purchase = ctm_bhvr_dt.groupby('CustomerID').InvoiceDate.max().reset_index()
ctm_max_purchase.columns = ['CustomerID','MaxPurchaseDate']
ctm_max_purchase.head()
| CustomerID | MaxPurchaseDate | |
|---|---|---|
| 0 | 12346.0 | 2011-01-18 10:17:00 |
| 1 | 12347.0 | 2011-08-02 08:48:00 |
| 2 | 12348.0 | 2011-04-05 10:47:00 |
| 3 | 12349.0 | 2010-10-28 08:23:00 |
| 4 | 12350.0 | 2011-02-02 16:01:00 |
#find the recency in days and add it to ctm_dt
ctm_max_purchase['Recency'] = (ctm_max_purchase['MaxPurchaseDate'].max() - ctm_max_purchase['MaxPurchaseDate']).dt.days
ctm_dt = pd.merge(ctm_dt, ctm_max_purchase[['CustomerID', 'Recency']], on='CustomerID')
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | |
|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 |
| 1 | 13078.0 | 13.0 | 0 |
| 2 | 15362.0 | 9999.0 | 348 |
| 3 | 18102.0 | 27.0 | 26 |
| 4 | 12682.0 | 15.0 | 0 |
pd.DataFrame(ctm_dt.Recency.describe())
| Recency | |
|---|---|
| count | 5314.000000 |
| mean | 204.675574 |
| std | 173.569622 |
| min | 0.000000 |
| 25% | 48.000000 |
| 50% | 161.000000 |
| 75% | 320.000000 |
| max | 638.000000 |
The mean Recency is approximately $205$ days whiles the median is $161$ days.
#plot recency
plot_data = [
go.Histogram(
x=ctm_dt['Recency']
)
]
plot_layout = go.Layout(
title="Customers Recency in Days",
xaxis_title="Recency in groups of 20 days",
yaxis_title="Number of Customers"
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
Next we will apply K-means clustering to assign a recency score. However, we need to know how many clusters in order to use the K-means algorithm. We will apply Elbow Method to determine how many clusters we will need. The Elbow Method simply tells the optimal cluster number for optimal inertia.
my_dict={}
ctm_recency = ctm_dt[['Recency']]
for idx in range(1, 10):
kmeans = KMeans(n_clusters=idx, max_iter=1000).fit(ctm_recency)
ctm_recency["clusters"] = kmeans.labels_
my_dict[idx] = kmeans.inertia_
plt.figure()
plt.plot(list(my_dict.keys()), list(my_dict.values()))
plt.xlabel("Number of cluster")
plt.show()
From the Figure above, $4$ seem to be the optimal one.
number_of_clusters = 4
Let's build a $4$ clusters for recency and add it to dataframe, ctm_dt.
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(ctm_dt[['Recency']])
ctm_dt['RecencyCluster'] = kmeans.predict(ctm_dt[['Recency']])
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | |
|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 2 |
| 1 | 13078.0 | 13.0 | 0 | 2 |
| 2 | 15362.0 | 9999.0 | 348 | 3 |
| 3 | 18102.0 | 27.0 | 26 | 2 |
| 4 | 12682.0 | 15.0 | 0 | 2 |
def order_cluster(df, target_field_name, cluster_field_name, ascending):
"""
INPUT:
- df - pandas DataFrame
- target_field_name - str - A column in the pandas DataFrame df
- cluster_field_name - str - Expected to be a column in the pandas DataFrame df
- ascending - Boolean
OUTPUT:
- df_final - pandas DataFrame with target_field_name and cluster_field_name as columns
"""
# Add the string "new_" to cluster_field_name
new_cluster_field_name = "new_" + cluster_field_name
# Create a new dataframe by grouping the input dataframe by cluster_field_name and extract target_field_name
# and find the mean
df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
# Sort the new dataframe df_new, by target_field_name in descending order
df_new = df_new.sort_values(by=target_field_name, ascending=ascending).reset_index(drop=True)
# Create a new column in df_new with column name index and assign it values to df_new.index
df_new["index"] = df_new.index
# Create a new dataframe by merging input dataframe df and part of the columns of df_new based on
# cluster_field_name
df_final = pd.merge(df, df_new[[cluster_field_name, "index"]], on=cluster_field_name)
# Update the dataframe df_final by deleting the column cluster_field_name
df_final = df_final.drop([cluster_field_name], axis=1)
# Rename the column index to cluster_field_name
df_final = df_final.rename(columns={"index": cluster_field_name})
return df_final
ctm_dt = order_cluster(ctm_dt, 'Recency', 'RecencyCluster', False)
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | |
|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 |
| 1 | 13078.0 | 13.0 | 0 | 3 |
| 2 | 18102.0 | 27.0 | 26 | 3 |
| 3 | 12682.0 | 15.0 | 0 | 3 |
| 4 | 18087.0 | 46.0 | 44 | 3 |
#print cluster characteristics
ctm_dt.groupby('RecencyCluster')['Recency'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| RecencyCluster | ||||||||
| 0 | 750.0 | 520.509333 | 65.218473 | 418.0 | 465.0 | 517.0 | 571.5 | 638.0 |
| 1 | 1427.0 | 314.360196 | 41.129298 | 236.0 | 281.5 | 308.0 | 338.0 | 416.0 |
| 2 | 1090.0 | 156.211927 | 38.996363 | 98.0 | 119.0 | 153.0 | 187.0 | 234.0 |
| 3 | 2047.0 | 38.300440 | 27.999147 | 0.0 | 13.0 | 33.0 | 61.0 | 97.0 |
Observe from the above that, $3$ covers the most recent customers whereas $0$ has the most inactive customers.
Next, we will find customers purchase behaviour based on the number of times they buy from the online retail shop. That is, the total number of orders by each customer.
#get order counts for each user and create a dataframe with it
ctm_frequency = df_data.groupby('CustomerID').InvoiceDate.count().reset_index()
ctm_frequency.columns = ['CustomerID','Frequency']
#add this data to our main ctm_dt
ctm_dt = pd.merge(ctm_dt, ctm_frequency, on='CustomerID')
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | |
|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 |
| 1 | 13078.0 | 13.0 | 0 | 3 | 855 |
| 2 | 18102.0 | 27.0 | 26 | 3 | 1068 |
| 3 | 12682.0 | 15.0 | 0 | 3 | 1042 |
| 4 | 18087.0 | 46.0 | 44 | 3 | 95 |
pd.DataFrame(ctm_dt.Frequency.describe())
| Frequency | |
|---|---|
| count | 5314.000000 |
| mean | 149.498495 |
| std | 378.177404 |
| min | 1.000000 |
| 25% | 22.000000 |
| 50% | 59.000000 |
| 75% | 158.750000 |
| max | 13097.000000 |
#plot the histogram
plot_data = [
go.Histogram(
x=ctm_dt.query('Frequency < 1200')['Frequency']
)
]
plot_layout = go.Layout(
title="Customers with Purchase Frequency less than 1200",
xaxis_title="Customer Frequency Purchase in groups of 20",
yaxis_title="Number of Customers"
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(ctm_dt[['Frequency']])
ctm_dt['FrequencyCluster'] = kmeans.predict(ctm_dt[['Frequency']])
ctm_dt = order_cluster(ctm_dt, 'Frequency', 'FrequencyCluster', False)
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | |
|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 |
#see details of each cluster
ctm_dt.groupby('FrequencyCluster')['Frequency'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| FrequencyCluster | ||||||||
| 0 | 2.0 | 12355.000000 | 1049.346463 | 11613.0 | 11984.0 | 12355.0 | 12726.00 | 13097.0 |
| 1 | 14.0 | 3921.000000 | 1541.056680 | 2430.0 | 2829.5 | 3346.5 | 4570.25 | 7307.0 |
| 2 | 402.0 | 718.293532 | 318.785817 | 404.0 | 484.0 | 604.5 | 849.50 | 2134.0 |
| 3 | 4896.0 | 87.025531 | 91.380843 | 1.0 | 20.0 | 51.0 | 123.00 | 403.0 |
As it was for the case of the Recency, higher frequency number means better customers.
ctm_revenue = df_data.groupby('CustomerID').Revenue.sum().reset_index()
#merge it with our ctm_dt
ctm_dt = pd.merge(ctm_dt, ctm_revenue, on='CustomerID')
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | Revenue | |
|---|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 | 1459.46 |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 | 14411.62 |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 | 12957.60 |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 | 5102.80 |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 | 5284.58 |
#plot the histogram
plot_data = [
go.Histogram(
x=ctm_dt.query('Revenue < 10000')['Revenue']
)
]
plot_layout = go.Layout(
title='Customers with Monetary Value below 10000',
xaxis_title='Customers Revenue',
yaxis_title="Number of Customers"
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
#apply clustering
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(ctm_dt[['Revenue']])
ctm_dt['RevenueCluster'] = kmeans.predict(ctm_dt[['Revenue']])
#order the cluster numbers
ctm_dt = order_cluster(ctm_dt, 'Revenue', 'RevenueCluster', True)
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | Revenue | RevenueCluster | |
|---|---|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 | 1459.46 | 0 |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 | 14411.62 | 0 |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 | 12957.60 | 0 |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 | 5102.80 | 0 |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 | 5284.58 | 0 |
#show details of the dataframe
ctm_dt.groupby('RevenueCluster')['Revenue'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| RevenueCluster | ||||||||
| 0 | 5253.0 | 2090.222205 | 3302.352286 | -25111.09 | 345.0000 | 930.100 | 2351.5800 | 25774.54 |
| 1 | 51.0 | 50041.483392 | 22639.938850 | 26278.86 | 32109.8350 | 42653.141 | 57724.5300 | 113513.07 |
| 2 | 8.0 | 198033.391250 | 62010.602305 | 136391.48 | 142827.5300 | 181355.750 | 242746.6750 | 296564.69 |
| 3 | 2.0 | 560778.645000 | 52943.312094 | 523342.07 | 542060.3575 | 560778.645 | 579496.9325 | 598215.22 |
#calculate overall score and use mean() to see details
ctm_dt['OverallScore'] = ctm_dt['RecencyCluster'] + ctm_dt['FrequencyCluster'] + ctm_dt['RevenueCluster']
ctm_dt.groupby('OverallScore')['Recency','Frequency','Revenue'].mean()
| Recency | Frequency | Revenue | |
|---|---|---|---|
| OverallScore | |||
| 3 | 514.613281 | 39.929688 | 435.112398 |
| 4 | 309.024038 | 96.556319 | 1220.047826 |
| 5 | 126.745415 | 265.268525 | 3921.574227 |
| 6 | 40.712192 | 144.792497 | 3511.843441 |
| 7 | 12.450000 | 676.900000 | 109639.778000 |
| 8 | 26.000000 | 1068.000000 | 598215.220000 |
The scoring above clearly shows us that customers with score $8$ are our best customers whereas those who score $3$ are the worst.
ctm_dt['Segment'] = 'Low-Value'
ctm_dt.loc[ctm_dt['OverallScore'] > 4, 'Segment'] = 'Mid-Value'
ctm_dt.loc[ctm_dt['OverallScore'] > 6, 'Segment'] = 'High-Value'
ctm_dt.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | Revenue | RevenueCluster | OverallScore | Segment | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 | 1459.46 | 0 | 6 | Mid-Value |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 | 14411.62 | 0 | 6 | Mid-Value |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 | 12957.60 | 0 | 6 | Mid-Value |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 | 5102.80 | 0 | 6 | Mid-Value |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 | 5284.58 | 0 | 6 | Mid-Value |
#A scatter plot of Revenue verses Frequency
ctm_graph = ctm_dt.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=ctm_graph.query("Segment == 'Low-Value'")['Frequency'],
y=ctm_graph.query("Segment == 'Low-Value'")['Revenue'],
mode='markers',
name='Low-Value',
marker=dict(
size=7,
line=dict(width=1),
color='blue',
opacity=0.8
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'Mid-Value'")['Frequency'],
y=ctm_graph.query("Segment == 'Mid-Value'")['Revenue'],
mode='markers',
name='Mid-Value',
marker=dict(
size=9,
line=dict(width=1),
color='green',
opacity=0.5
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'High-Value'")['Frequency'],
y=ctm_graph.query("Segment == 'High-Value'")['Revenue'],
mode='markers',
name='High-Value',
marker=dict(
size=11,
line=dict(width=1),
color='red',
opacity=0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Revenue"},
xaxis= {'title': "Frequency"},
title='Segments'
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
# Scatter Plot of Revenue verses Recency
ctm_graph = ctm_dt.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=ctm_graph.query("Segment == 'Low-Value'")['Recency'],
y=ctm_graph.query("Segment == 'Low-Value'")['Revenue'],
mode='markers',
name='Low-Value',
marker=dict(
size=7,
line=dict(width=1),
color='blue',
opacity=0.8
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'Mid-Value'")['Recency'],
y=ctm_graph.query("Segment == 'Mid-Value'")['Revenue'],
mode='markers',
name='Mid-Value',
marker=dict(
size=9,
line=dict(width=1),
color='green',
opacity=0.5
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'High-Value'")['Recency'],
y=ctm_graph.query("Segment == 'High-Value'")['Revenue'],
mode='markers',
name='High-Value',
marker=dict(
size=11,
line=dict(width=1),
color='red',
opacity=0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Revenue"},
xaxis= {'title': "Recency"},
title='Segments'
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
# Scatter Plot of Revenue verses Frequency
ctm_graph = ctm_dt.query("Revenue < 50000 and Frequency < 2000")
plot_data = [
go.Scatter(
x=ctm_graph.query("Segment == 'Low-Value'")['Recency'],
y=ctm_graph.query("Segment == 'Low-Value'")['Frequency'],
mode='markers',
name='Low-Value',
marker= dict(
size=7,
line=dict(width=1),
color='blue',
opacity=0.8
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'Mid-Value'")['Recency'],
y=ctm_graph.query("Segment == 'Mid-Value'")['Frequency'],
mode='markers',
name='Mid-Value',
marker=dict(
size=9,
line=dict(width=1),
color='green',
opacity=0.5
)
),
go.Scatter(
x=ctm_graph.query("Segment == 'High-Value'")['Recency'],
y=ctm_graph.query("Segment == 'High-Value'")['Frequency'],
mode='markers',
name='High-Value',
marker=dict(
size=11,
line=dict(width=1),
color='red',
opacity=0.9
)
),
]
plot_layout = go.Layout(
yaxis= {'title': "Frequency"},
xaxis= {'title': "Recency"},
title='Segments'
)
fig = go.Figure(
data=plot_data,
layout=plot_layout
)
pyoff.iplot(fig)
Let us create a copy of the dataframe ctm_dt and apply the method get_dummies to it so as to convert all categorical column Segment to indicator variables.
#create ctm_class as a copy of ctm_dt before applying get_dummies
ctm_class = ctm_dt.copy()
ctm_class = pd.get_dummies(ctm_class)
ctm_class.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | Revenue | RevenueCluster | OverallScore | Segment_High-Value | Segment_Low-Value | Segment_Mid-Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 | 1459.46 | 0 | 6 | 0 | 0 | 1 |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 | 14411.62 | 0 | 6 | 0 | 0 | 1 |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 | 12957.60 | 0 | 6 | 0 | 0 | 1 |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 | 5102.80 | 0 | 6 | 0 | 0 | 1 |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 | 5284.58 | 0 | 6 | 0 | 0 | 1 |
Since our goal is to estimate whether a customer will make a purchase in the next quarter, we will create a new column NextPurchaseDayRange with values as either $1$ or $0$ defined as follows:
ctm_class['NextPurchaseDayRange'] = 1 ## less than 3 months
ctm_class.loc[ctm_class.NextPurchaseDay>90,'NextPurchaseDayRange'] = 0 # more than 3 months
ctm_class.head()
| CustomerID | NextPurchaseDay | Recency | RecencyCluster | Frequency | FrequencyCluster | Revenue | RevenueCluster | OverallScore | Segment_High-Value | Segment_Low-Value | Segment_Mid-Value | NextPurchaseDayRange | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13085.0 | 9999.0 | 57 | 3 | 92 | 3 | 1459.46 | 0 | 6 | 0 | 0 | 1 | 0 |
| 1 | 18087.0 | 46.0 | 44 | 3 | 95 | 3 | 14411.62 | 0 | 6 | 0 | 0 | 1 | 1 |
| 2 | 14110.0 | 30.0 | 19 | 3 | 403 | 3 | 12957.60 | 0 | 6 | 0 | 0 | 1 | 1 |
| 3 | 17519.0 | 116.0 | 33 | 3 | 224 | 3 | 5102.80 | 0 | 6 | 0 | 0 | 1 | 0 |
| 4 | 12362.0 | 40.0 | 12 | 3 | 275 | 3 | 5284.58 | 0 | 6 | 0 | 0 | 1 | 1 |
Finally in this section, let's see the correlation between our features and label. We can achieve this by applying the corr method to the dataframe ctm_dt.
corr_matrix = ctm_class[ctm_class.columns].corr()
corr_df = pd.DataFrame(corr_matrix.min())
corr_df.columns = ['MinCorrelationCoeff']
corr_df['MaxCorrelationCoeff'] = corr_matrix[corr_matrix < 1].max()
corr_df
| MinCorrelationCoeff | MaxCorrelationCoeff | |
|---|---|---|
| CustomerID | -0.043179 | 0.040509 |
| NextPurchaseDay | -0.569494 | 0.451357 |
| Recency | -0.964632 | 0.857126 |
| RecencyCluster | -0.964632 | 0.968062 |
| Frequency | -0.704553 | 0.476721 |
| FrequencyCluster | -0.704553 | 0.270921 |
| Revenue | -0.378214 | 0.824148 |
| RevenueCluster | -0.338447 | 0.824148 |
| OverallScore | -0.932148 | 0.968062 |
| Segment_High-Value | -0.075834 | 0.671518 |
| Segment_Low-Value | -0.991924 | 0.857126 |
| Segment_Mid-Value | -0.991924 | 0.882628 |
| NextPurchaseDayRange | -0.569494 | 0.516154 |
From the output above, we observe that Overall Score has the highest positive correlation of $0.97$ with RecencyCluster and Segment_Low-Value has the highest negative of -0.99 with Segment_Mid-Value.
We can get a good visualisation of the coefficient matrix below.
plt.figure(figsize = (40, 30))
sns.heatmap(corr_matrix, annot = True, linewidths=0.2, fmt=".2f");
ctm_class = ctm_class.drop('NextPurchaseDay', axis=1)
X, y = ctm_class.drop('NextPurchaseDayRange', axis=1), ctm_class.NextPurchaseDayRange
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=44)
# Create an array of models
models = []
models.append(("LogisticRegression", LogisticRegression()))
models.append(("GaussianNB", GaussianNB()))
models.append(("RandomForestClassifier", RandomForestClassifier()))
models.append(("SVC", SVC()))
models.append(("DecisionTreeClassifier", DecisionTreeClassifier()))
models.append(("xgb.XGBClassifier", xgb.XGBClassifier(eval_metric='mlogloss')))
models.append(("KNeighborsClassifier", KNeighborsClassifier()))
# Measuring the accuracy of the models
for model_abbrv, model in models:
trained_model = model.fit(X_train, y_train)
print('Accuracy of {} on training set: {:.2f}'.format(model_abbrv, trained_model.score(X_train, y_train)))
print('Accuracy of {} on test set: {:.2f}'.format(model_abbrv, trained_model.score(X_test[X_train.columns], y_test)))
print()
Accuracy of LogisticRegression on training set: 0.91 Accuracy of LogisticRegression on test set: 0.90 Accuracy of GaussianNB on training set: 0.89 Accuracy of GaussianNB on test set: 0.89 Accuracy of RandomForestClassifier on training set: 1.00 Accuracy of RandomForestClassifier on test set: 0.91 Accuracy of SVC on training set: 0.84 Accuracy of SVC on test set: 0.85 Accuracy of DecisionTreeClassifier on training set: 1.00 Accuracy of DecisionTreeClassifier on test set: 0.88 Accuracy of xgb.XGBClassifier on training set: 1.00 Accuracy of xgb.XGBClassifier on test set: 0.90 Accuracy of KNeighborsClassifier on training set: 0.89 Accuracy of KNeighborsClassifier on test set: 0.85
From this result, we see that LogisticRegression is the best in terms of accuracy.
log_reg_model = LogisticRegression().fit(X_train, y_train)
print('Accuracy of LogisticRegression on training set: {:.2f}'.format(log_reg_model.score(X_train, y_train)))
print('Accuracy of LogisticRegression on test set: {:.2f}'.format(log_reg_model.score(X_test[X_train.columns], y_test)))
Accuracy of LogisticRegression on training set: 0.91 Accuracy of LogisticRegression on test set: 0.90
Let's see how we could improve on the accuracy of the existing model XGB by finding suitable parameters. We will check if the improved XGB Classifier model outperforms the LogisticRegression
The XGB Classifier model gives us the following:
xgb_model = xgb.XGBClassifier(eval_metric='logloss').fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'.format(xgb_model.score(X_test[X_train.columns], y_test)))
Accuracy of XGB classifier on training set: 1.00 Accuracy of XGB classifier on test set: 0.90
parameter = {
'max_depth':range(3,10,2),
'min_child_weight':range(1,5,2)
}
p_grid_search = GridSearchCV(
estimator = xgb.XGBClassifier(eval_metric='mlogloss'),
param_grid = parameter,
scoring='accuracy',
n_jobs=-1,
iid=False
)
p_grid_search.fit(X_train, y_train)
p_grid_search.best_params_, p_grid_search.best_score_
({'max_depth': 3, 'min_child_weight': 1}, 0.8993203843229418)
xgb_model = xgb.XGBClassifier(
eval_metric='logloss',
max_depth=list(p_grid_search.best_params_.values())[0]-2,
min_child_weight=list(p_grid_search.best_params_.values())[-1]+6
).fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'.format(xgb_model.score(X_test[X_train.columns], y_test)))
Accuracy of XGB classifier on training set: 0.92 Accuracy of XGB classifier on test set: 0.92
As we can see, the improved XGB classifier model is accurate than the LogisticRegression model by $0.2$.
Let us go ahead and predict with these two models.
xgb_pred_y = xgb_model.predict(X_test)
log_reg_pred_y = log_reg_model.predict(X_test)
Let compute the confusion matrices of these two models with the user-defined function get_confusion_matrix defined below.
def get_confusion_matrix(y_test, y_pred):
"""
Displays the confusion matrix of the input numpy arrays y_test and y_pred.
INPUT:
y_test - A numpy array
y_pred - A numpy array
OUTPUT:
NoneType
"""
data = {'y_Actual': y_test, 'y_Predicted': y_pred}
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
conf_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'],
rownames=['Actual'],
colnames=['Predicted'])
sns.heatmap(conf_matrix, annot=True, fmt = "d", cmap="Spectral")
plt.show()
Let's get the confusion matrix for y_test and xgb_pred_y, i.e., for the improved XGB classifier model and y_test and log_reg_pred_y, i.e., for the LogisticRegression model.
get_confusion_matrix(np.array(y_test), xgb_pred_y)
get_confusion_matrix(np.array(y_test), log_reg_pred_y)
Since we are concerned with accuracy of our model, we will choose the improved XGB classifier model over the LogisticRegression model.
In summary I observed the following:
From the dataset, the online retail shop has 5942 customers from 41 different countries.
The top 3 countries are the United Kingdom, Germany and France with the United Kingdom recording the $90.85\%$ of the customers of the online retail shop.
The online retail shop made the highest sales in November 2010, followed by November 2011. Also, there was a rise in monthly sales after August.
The online retail shop got most of it revenue from customers residing in the United Kingdom followed by The Republic of Ireland, The Netherlands, Germany and France.
Finally, we introduced new features into our dataframe, ctm_dt and we builded a machine learning model that will help us decide if customer will buy something again from the online shop in the next quarter, i.e., $90$ days from the date of his/her last purchase. Among the models that we used, we had to improve the XGB classifier model by the process of hyperparameter tuning to outperform the LogisticRegression model.